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ABSTRACT 

There are many on-line settings in which users pubhcly express 
opinions. A number of these offer mechanisms for other users 
to evaluate these opinions; a canonical example is Amazon.com, 
where reviews come with annotations like "26 of 32 people found 
the following review helpful." Opinion evaluation appears in many 
off-line settings as well, including market research and political 
campaigns. Reasoning about the evaluation of an opinion is funda- 
mentally different from reasoning about the opinion itself: rather 
than asking, "What did Y think of X?", we are asking, "What did Z 
think of Y's opinion of X?" Here we develop a framework for an- 
alyzing and modeling opinion evaluation, using a large-scale col- 
lection of Amazon book reviews as a dataset. We find that the per- 
ceived helpfulness of a review depends not just on its content but 
also but also in subtle ways on how the expressed evaluation relates 
to other evaluations of the same product. As part of our approach, 
we develop novel methods that take advantage of the phenomenon 
of review "plagiarism" to control for the effects of text in opin- 
ion evaluation, and we provide a simple and natural mathematical 
model consistent with our findings. Our analysis also allows us 
to distinguish among the predictions of competing theories from 
sociology and social psychology, and to discover unexpected dif- 
ferences in the collective opinion-evaluation behavior of user pop- 
ulations from different countries. 

Categories and Subject Descriptors: H.2.8 [Database Manage- 
ment]: Database Applications - Data Mining 

General Terms: Measurement, Theory 

Keywords: Review helpfulness, review utility, social influence, 
online communities, sentiment analysis, opinion mining, plagia- 
rism. 

1. INTRODUCTION 

Understanding how people's opinions are received and evaluated 
is a fundamental problem that arises in many domains, such as in 
marketing studies of the impact of reviews on product sales, or in 
poHtical science models of how support for a candidate depends 
on the views he or she expresses on different topics. This issue 
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is also increasingly important in the user interaction dynamics of 
large participatory Web sites. 

Here we develop a framework for understanding and modeling 
how opinions are evaluated within on-line communities. The prob- 
lem is related to the lines of computer- science research on opinion, 
sentiment, and subjective content [18], but with a crucial twist in 
its formulation that makes it fundamentally distinct from that body 
of work. Rather than asking questions of the form "What did Y 
think of X?", we are asking, "What did Z think of Y's opinion 
of X?" Crucially, there are now three entities in the process rather 
than two. Such three-level concerns are widespread in everyday 
life, and integral to any study of opinion dynamics in a commu- 
nity. For example, political polls will more typically ask, "How do 
you feel about Barack Obama's position on taxes?" than "How do 
you feel about taxes?" or "What is Barack Obama's position on 
taxes?" (though all of these are useful questions in different con- 
texts). Also, Heider's theory of structural balance in social psy- 
chology seeks to understand subjective relationships by consider- 
ing sets of three entities at a time as the basic unit of analysis. But 
there has been relatively little investigation of how these three-way 
effects shape the dynamics of on-Hne interaction, and this is the 
topic we consider here. 

The Helpfulness of Reviews. The evaluation of opinions takes 
place at very large scales every day at a number of widely-used 
Web sites. Perhaps most prominently it is exemplified by one of the 
largest online e-commerce providers, Amazon.com, whose web- 
site includes not just product reviews contributed by users, but also 
evaluations of the helpfulness of these reviews. (These consist of 
annotations that say things like, "26 of 32 people found the follow- 
ing review helpful", with the corresponding data-gathering ques- 
tion, "Was this review helpful to you?") Note that each review on 
Amazon thus comes with both a star rating — the number of num- 
ber of stars it assigns to the product — and a helpfulness vote — the 
information that a out of b people found the review itself helpful. 
(See Figure 4 for two examples.) This distinction reflects precisely 
the kind of opinion evaluation we are considering: in addition to 
the question "what do you think of book X?", users are also being 
asked "what do you think of user Y's review of book X?" A large- 
scale snapshot of Amazon reviews and helpfulness votes will form 
the central dataset in our study, as detailed below. 

The factors affecting human helpfulness evaluations are not well 
understood. There has been a small amount of work on automatic 



determination of helpfulness, treating it as a classification or re- 
gression problem with Amazon helpfulness votes providing labeled 
data [10, 15, 17]. Some of this research has indicated that the help- 
fulness votes of reviews are not necessarily strongly correlated with 
certain measures of review quahty; for example, Liu et al. found 
that when they provided independent human annotators with Ama- 
zon review text and a precise specification of helpfulness in terms 
of the thoroughness of the review, the annotators' evaluations dif- 
fered significantly from the helpfulness votes observed on Amazon. 

All of this suggests that there is in fact a subtle relationship be- 
tween two different meanings of "helpfulness": helpfulness in the 
narrow sense — does this review help you in making a purchase 
decision? — and helpfulness "in the wild," as defined by the way 
in which Amazon users evaluate each others' reviews in practice. 
It is a kind of dichotomy familiar from the design of participatory 
Web sites, in which a presumed design goal — that of highlighting 
reviews that are helpful in the purchase process — becomes inter- 
twined with complex social feedback mechanisms. If we want to 
understand how these definitions interact with each other, so as to 
assist users in interpreting helpfulness evaluations, we need to elu- 
cidate what these feedback mechanisms are and how they affect the 
observed outcomes. 

The present work: Social mechanisms underlying helpfulness 
evaluation. In this paper, we formulate and assess a set of theo- 
ries that govern the evaluation of opinions, and apply these to a 
dataset consisting of over four million reviews of roughly 675,000 
books on Amazon's U.S. site, as well as smaller but comparably- 
sized corpora from Amazon's U.K., Germany, and Japan sites. The 
resulting analysis provides a way to distinguish among competing 
hypotheses for the social feedback mechanisms at work in the eval- 
uation of Amazon reviews: we offer evidence against certain of 
these mechanisms, and show how a simple model can directly ac- 
count for a relatively complex dependence of helpfulness on re- 
view and group characteristics. We also use a novel experimental 
methodology that takes advantage of the phenomenon of review 
"plagiarism" to control for the text content of the reviews, enabling 
us to focus exclusively on factors outside the text that affect help- 
fulness evaluation. 

In our initial exploration of non-textual factors that are corre- 
lated with helpfulness evaluation on Amazon, we found a broad 
collection of effects at varying levels of strength.^ A significant and 
particularly wide-ranging set of effects is based on the relationship 
of a review's star rating to the star ratings of other reviews for the 
same product. We view these as fundamentally social effects, given 
that they are based on the relationship of one user's opinion to the 
opinions expressed by others in the same setting. ^ 

Research in the social sciences provides a range of well-studied 



^For example, on the U.S. Amazon site, we find that reviews from 
authors with addresses in U.S. territories outside the 50 states get 
consistently lower helpfulness votes. This is a persistent effect 
whose possible bases lie outside the scope of the present paper, 
but it illustrates the ways in which non-textual factors can be cor- 
related with helpfulness evaluations. Previous work has also noted 
that longer reviews tend to be viewed as more helpful; ultimately 
it is a definitional question whether review length is a textual or 
non-textual feature of the review. 

^A contrarian might put forth the following non-socially-based al- 
ternative hypothesis: the people evaluating review helpfulness are 
considering actual product quality rather than other reviews, but 
aggregate opinion happens to coincide with objective product qual- 
ity. This hypothesis is not consistent with our experimental results. 
However, in future work it might be interesting to directly control 
for product quality. 



hypotheses for how social effects influence a group's reaction to an 
opinion, and these provide a valuable starting point for our analysis 
of the Amazon data. In particular, we consider the following three 
broad classes of theories, as well as a fourth straw-man hypothesis 
that must be taken into account. 

(i) The conformity hypothesis. One hypothesis, with roots in the 
social psychology of conformity [4], holds that a review is 
evaluated as more helpful when its star rating is closer to the 
consensus star rating for the product — for example, when 
the number of stars it assigns is close to the average number 
of stars over all reviews. 

(ii) The individual-bias hypothesis. Alternately, one could hy- 
pothesize that when a user considers a review, he or she will 
rate it more highly if it expresses an opinion that he or she 
agrees with.^ Note the contrasts and similarities with the pre- 
vious hypothesis: rather than evaluating whether a review is 
close to the mean opinion, a user evaluates whether it is close 
to their own opinion. At the same time, one might expect that 
if a diverse range of individuals apply this rule, then the over- 
all helpfulness evaluation could be hard to distinguish from 
one based on conformity; this issue turns out to be crucial, 
and we explore it further below. 

(iii) The brilliant-but-cruel hypothesis. The name of this hypoth- 
esis comes from studies performed by Amabile [3] that sup- 
port the argument that "negative reviewers [are] perceived 
as more intelligent, competent, and expert than positive re- 
viewers." One can recognize everyday analogues of this phe- 
nomenon; for example, in a research seminar, a dynamic may 
arise in which the nastiest question is consistently viewed as 
the most insightful. 

(iv) The quality-only straw-man hypothesis. Finally, there is a 
challenging methodological complication in all these styles 
of analysis: without specific evidence, one cannot dismiss 
out of hand the possibility that helpfulness is being evalu- 
ated purely based on the textual content of the reviews, and 
that these non-textual factors are simply correlates of textual 
quality. In other words, it could be that people who write 
long reviews, people who assign particular star ratings in 
particular situations, and people from Massachusetts all sim- 
ply write reviews that are textually more helpful — and that 
users performing helpfulness evaluations are simply reacting 
to the text in ways that are indirectly reflected in these other 
features. Ruling out this hypothesis requires some means of 
controlling for the text of reviews while allowing other fea- 
tures to vary, a problem that we also address below. 

We now consider how data on star ratings and helpfulness votes 
can support or contradict these hypotheses, and what it says about 
possible underlying social mechanisms. 

Deviation from the mean. A natural first measure to investigate 
is the relationship of a review's star rating to the mean star rating of 
all reviews for the product; this, for example, is the underpinning 
of the conformity hypothesis. With this in mind, let us define the 
helpfulness ratio of a review to be the fraction of evaluators who 
found it to be helpful (in other words, it is the fraction a/h when a 
out of h people found the review helpful), and let us define the prod- 
uct average for a review of a given product to be the average star 

^Such a principle is also supported by structural balance considera- 
tions from social psychology; due to the space limitations, we omit 
a discussion of this here. 



rating given by all reviews of that product. We find (Figure 1) that 
the median helpfulness ratio of reviews decreases monotonically as 
a function the absolute difference between their star rating and the 
product average. (The same trend holds for other quantiles.) In fact 
the dependence is surprisingly smooth, with even seemingly sub- 
tle changes in the differences from the average having noticeable 
effects. 

This finding on its own is consistent with the conformity hypoth- 
esis: reviews in aggregate are deemed more helpful when they are 
close to the product average. However, a closer look at the data 
raises complications, as we now see. First, to assess the brilliant- 
but-cmel hypothesis, it is natural to look not at the absolute dif- 
ference between a review's star rating and its product average, but 
at the signed difference, which is positive or negative depending 
on whether the star rating is above or below the average. Here 
we find something a bit surprising (Figure 2). Not only does the 
median helpfulness as a function of signed difference fall away on 
both sides of 0; it does so asymmetrically: slightly negative reviews 
are punished more strongly, with respect to helpfulness evaluation, 
than slightly positive reviews. In addition to being at odds with the 
brilliant-but-cruel hypothesis for Amazon reviews, this observation 
poses problems for the conformity hypothesis in its pure form. It 
is not simply that closeness to the average is rewarded; among re- 
views that are slightly away from the mean, there is a bias toward 
overly positive ones. 

Variance and individual bias. One could, of course, amend the 
conformity hypothesis so that it becomes a "conformity with a ten- 
dency toward positivity" hypothesis. But this would beg the ques- 
tion; it wouldn't suggest any underlying mechanism for where the 
favorable evaluation of positive reviews is coming from. Instead, to 
look for such a mechanism, we consider versions of the individual- 
bias hypothesis. Now, recall that it can be difficult to distinguish 
conformity effects from individual-bias effects in a domain such as 
ours: if people's opinions (i.e., star ratings) for a product come from 
a single -peaked distribution with a maximum near the average, then 
the composite of their individual biases can produce overall help- 
fulness votes that look very much like the results of conformity. We 
therefore seek out subsets of the products on which the two effects 
might be distinguishable, and the argument above suggests starting 
with products that exhibit high levels of individual variation in star 
ratings. 

In particular, we associate with each product the variance of the 
star ratings assigned to it by all its reviews. We then group products 
by variance, and perform the signed-difference analysis above on 
sets of products having fixed levels of variance. We find (Figure 3) 
that the effect of signed difference to the average changes smoothly 
but in a complex fashion as the variance increases. The role of 
variance can be summarized as follows. 

• When the variance is very low, the reviews with the highest 
helpfulness ratios are those with the average star rating. 

• With moderate values of the variance, the reviews evaluated 
as most helpful are those that are slightly above the average 
star rating. 

• As the variance becomes large, reviews with star ratings both 
above and below the average are evaluated as more helpful 
than those that have the average star rating (with the positive 
reviews still deemed somewhat more helpful). 

These principles suggest some qualitative "rules" for how — all 
other things being equal — one can seek good helpfulness evalu- 
ations in our setting: With low variance go with the average; with 



moderate variance be slightly above average; and with high vari- 
ance avoid the average. 

This qualitative enumeration of principles initially seems to be 
fairly elaborate; but as we show in Section 5, all these principles are 
consistent with a simple model of individual bias in the presence of 
controversy. Specifically, suppose that opinions are drawn from a 
mixture of two single-peaked distributions — one with larger mix- 
ing weight whose mean is above the overall mean of the mixture, 
and one with smaller mixing weight whose mean is below it. Now 
suppose that each user has an opinion from this mixture, corre- 
sponding to their own personal score for the product, and they eval- 
uate reviews as helpful if the review's star rating is within some 
fixed tolerance of their own. We can show that in this model, as 
variance increases from 0, the reviews evaluated as most helpful 
are initially slightly above the overall mean, and eventually a "dip" 
in helpfulness appears around the mean. 

Thus, a simple model can in principle account for the fairly com- 
plex series of effects illustrated in Figure 3, and provide a hypoth- 
esis for an underlying mechanism. Moreover, the effects we see 
are surprisingly robust as we look at different national Amazon 
sites for the U.K., Germany, and Japan. Each of these commu- 
nities has evolved independently, but each exhibits the same set of 
patterns. The one non- trivial and systematic deviation from the pat- 
tern among these four countries is in the analogue of Figure 3 for 
Japan: as with the other countries, a "dip" appears at the average 
in the high- variance case, but in Japan the portion of the curve be- 
low the average is higher. This would be consistent with a version 
of our two-distribution individual-bias model in which the distri- 
bution below the average has higher mixing weight — representing 
an aspect of the brilliant-but-cruel hypothesis in this individual-bias 
framework, and only for this one national version of the site. 

Controlling for text: Taking advantage of "plagiarism". Fi- 
nally, we return to one further issue discussed earlier: how can we 
offer evidence that these non-textual features aren't simply serving 
as correlates of review-quality features that are intrinsic to the text 
itself? In other words, are there experiments that can address the 
quality-only straw man hypothesis above? 

To deal with this, we make use of rampant "plagiarism" and du- 
plication of reviews on Amazon.com (the causes and implications 
of this phenomenon are beyond the scope of this paper). This is 
a fact that has been noted and studied by earlier researchers [7], 
and for most applications it is viewed as a pathology to be reme- 
died. But for our purposes, it makes possible a remarkably effec- 
tive way to control for the effect of review text. Specifically, we 
define a "plagiarized" pair of reviews to be two reviews of differ- 
ent products with near-complete textual overlap, and we enumer- 
ate the several thousand instances of plagiarized pairs on Amazon. 
(We distinguish these from reviews that have been cross-posted by 
Amazon itself to different versions of the same product.) 

Not only are the two members of a "plagiarized" pair associated 
with different products; very often they also have significantly dif- 
ferent star ratings and are being used on products with different 
averages and variances. (For example, one copy of the review may 
be used to praise a book about the dangers of global warming while 
the other copy is used to criticize a book that is favorable toward 
the oil industry). We find significant differences in the helpfulness 
ratios within plagiarized pairs, and these differences confirm many 
of the the effects we observe on the full dataset. Specifically, within 
a "plagiarized" pair, the copy of the review that is closer to the av- 
erage gets the higher helpfulness ratio in aggregate. 

Thus the widespread copying of reviews provides us with a way 
to see that a number of social feedback effects — based on the 



score of a review and its relation to other scores — lead to different 
outcomes even for reviews that are textually close to identical. 

Further related work. We also mention some relevant prior lit- 
erature that has not already been discussed above. The role of so- 
cial and cognitive factors in purchasing decision-making has been 
extensively studied in psychology and marketing [6, 8, 9, 21], re- 
cently making use of brain imaging methodology [16]. Character- 
istics of the distribution of review star ratings (which differ from 
helpfulness votes) on Amazon and related sites have been studied 
previously [5, 13, 23]. Categorizing text by quality has been pro- 
posed for a number of applications [1, 12, 14, 19]. Additionally, 
our notion of variance is potentially related to the idea that people 
play different roles in on-line discussion [22]. 

2. DATA 

Our experiments employed a dataset of over 4 million Ama- 
zon.com book reviews (corresponding to roughly 675,000 books), 
of which more than 1 million received at least 10 helpfulness votes 
each. We made extensive use of the Amazon Associates Webser- 
vice (AWS) API to collect this data.^ We describe the process in 
this section, with particular attention to measures we took to avoid 
sample bias. 

We would ideally have liked to work with all book reviews posted 
to Amazon. However, one can only access reviews via queries 
specifying particular books by their Amazon product ID, or ASIN 
(which is the same as ISBN for most books), and we are not aware 
of any publicly available list of all Amazon book ASINs. However, 
the API allows one to query for books in a specific category (called 
a browse-node in AWS parlance and corresponding to a section on 
the Amazon.com website), and the best-selling titles up to a limit 
of 4000 in each browse-node can be obtained in this way. 

To create our initial list of books, therefore, we performed queries 
for all 3855 categories three levels deep in the Amazon browse- 
node hierarchy (actually a directed acyclic graph) rooted at "Books ^ 
Subj ects" . An example category is Children 's Books^ Animals^ 
Lions, Tigers & Leopards. These queries resulted in the initial set 
of 3,301,940 books, where we count books listed in multiple cate- 
gories only once. 

We then performed a book-filtering step to deal with "cross- 
posting" of reviews across versions. When Amazon carries dif- 
ferent versions of the same item — for example, different editions 
of the same book, including hardcover and softcover editions and 
audio-books — the reviews written for all versions are merged 
and displayed together on each version's product page and like- 
wise returned by the API upon queries for any individual version.^ 
This means that multiple copies of the same review exist for "me- 
chanical", as opposed to user-driven, reasons.^ To avoid including 
mechanically-duplicated reviews, we retained only one of the set 
of alternate versions for each book (the one with the most complete 
metadata). 

The above process gave us a list of 674,018 books for which we 
retrieved reviews by querying AWS. Although AWS restricts the 
number of reviews returned for any given product query to a max- 

"^We used the AWS API version 2008-04-07. Documentation 
is available at http://docs.amazonwebservices.com/ 
AWSECommerceService/2 8-0 4-0 7/DG/ . 
^At the time of data collection, the API did not provide an option 
to trace a review to a particular edition for which it was originally 
posted, despite the fact that the Web store front-end has included 
such links for quite some time. 

^We make use of human-instigated review copying later in this 
study. 



imum of 100, it turned out that 99.3% of our books had 100 or 
fewer reviews. In the case of the remaining 4664 books, we chose 
to retrieve the 100 earliest reviews for each product to be able to 
reconstruct the information available to the authors and readers of 
those reviews to the extent possible. (Using the earliest reviews 
ensures the reproducibility of our results, since the 100 earliest re- 
views comprise a static set, unlike the 100 most helpful or recent re- 
views.) As a result, we ended up with 4,043,103 reviews; although 
some reviews were not retrieved due to the 100-reviews-per-book 
API cap, the number of missing reviews averages out to roughly 
just one per ASIN queried. Finally, we focused on the 1,008,466 
reviews that had at least 10 helpfulness votes each. 

The size of our dataset compares favorably to that of collections 
used in other studies looking at helpfulness votes: Liu et al. [17] 
used about 23,000 digital camera reviews (of which a subset of 
around 4900 were subsequently given new helpfulness votes and 
studied more carefully); Zhang and Varadarajan [24] used about 
2500 reviews of electronics, engineering books, and PG-13 movies 
after filtering out duplicate reviews and reviews with no more than 
10 helpfulness votes; Kim et al. [15] used about 26,000 MP3 and 
digital-camera reviews after filtering of duplicate versions and du- 
plicate reviews and reviews with fewer than 5 helpfulness votes; 
and Ghose and Ipeirotis [11] considered "all reviews since the prod- 
uct was released into the market" (no specfic number is given) for 
about 400 popular audio and video players, digital cameras, and 
DVDs. 

3. EFFECTS OF DEVIATION FROM AV- 
ERAGE AND VARIANCE 

Several of the hypotheses that we have described concern the 
relative position of an opinion about an entity vis-a-vis the average 
opinion about that entity. We now turn, therefore, to the question 
of how the helpfulness ratio of a review depends on its star rating's 
deviation from the average star rating for all reviews of the same 
book. According to the conformity hypothesis, the helpfulness ra- 
tio should be lower for reviews with star ratings either above or be- 
low the product average, whereas the brilliant-but-cruel hypothesis 
translates to the "asymmetric" prediction that the helpfulness ratio 
should be higher for reviews with star ratings below the product 
average than for overly positive reviews. (No specific predictions 
for helpfulness ratio vis-a-vis product average is made by either the 
individual-bias or quality-only hypothesis without further assump- 
tions about the distribution of individual opinions or text quality.) 

Defining tlie average. For a given review, let the computed product- 
average star rating (abbreviation: computed star average) be the 
average star rating as computed over all reviews of that product in 
our dataset. 

This differs in principle from the Amazon-displayed product- 
average star rating (abbreviation: displayed star average), the "Av- 
erage Customer Review" score that Amazon itself displayed for the 
book at the time we downloaded the data. One reason for the differ- 
ence is that Amazon rounds the displayed star average to the nearest 
half-star (e.g., 3.5 or 4.0) — but for our experiments it is preferable 
to have a greater degree of resolution. Another possible source of 
difference is the very small (0.7%) fraction of books, mentioned in 
Section 2, for which the entire set of reviews could not be obtained 
via AWS: the displayed star average would be partially based on 
reviews that came later than the first 100 and which would thus not 
be in our dataset. However, the mean absolute difference between 
the computed star average when rounded to the nearest half-star 
(0.5 increment) and the displayed star average is only 0.02. 
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Figure 1: Helpfulness ratio declines with the absolute value of 
a review's deviation from the computed star average; this be- 
havior is predicted by the conformity hypothesis but not ruled 
out by the other hypotheses. 

The line segments within the bars (connected by the descend- 
ing line) indicate the median helpfulness ratio; the bars depict 
the helpfulness ratio's second and third quan tiles. 

Throughout, grey bars indicate that the amount of data at 
that X value represents .1% or less of the data depicted in the 
plot. 



Note that both scores can differ from the "Average Customer Re- 
view" score that Amazon displayed at the time a helpfulness evalu- 
ator provided their helpfulness vote, since this time might pre-date 
some of the reviews for the book that are in our dataset (and hence 
that Amazon based its displayed star average on). In the absence 
of timestamps on helpfulness votes, this is not a factor that can be 
controlled for. 

Deviation experiments. We first check the prediction of the con- 
formity hypothesis that the helpfulness ratio of a review will vary 
inversely with the absolute value of the difference between the re- 
view's star rating and the computed product- average star rating — 
we call this difference the review's deviation. 

Figure 1 indeed shows a very strong inverse correlation between 
the median helpfulness ratio and the absolute deviation, as pre- 
dicted by the conformity hypothesis. However, this data does not 
completely disprove the brilliant-but-cruel hypothesis, since for a 
given absolute deviation \x\ > 0, it could conceivably happen that 
reviews with positive deviations \x\ (i.e. more favorable than aver- 
age) could have much worse helpfulness ratios than reviews with 
negative deviation —\x\, thus dragging down the median helpful- 
ness ratio. Rather, to directly assess the brilliant-but-cruel hypothe- 
sis, we must consider signed deviation, not just absolute deviation. 

Surprisingly, the effect of signed deviation on median helpful- 
ness ratio, depicted in the "Christmas-tree" plot of Figure 2, turns 
out to be different from what either hypothesis would predict. 

The brilliant-but-cruel hypothesis clearly does not hold for our 
data: among reviews with the same absolute deviation \x\ > 0, 
the relatively positive ones (signed deviation |x|) generally have a 
higher median helpfulness ratio than the relatively negative ones 
(signed deviation of — |x|), as depicted by the positive slope of the 
green dotted lines connecting (— |x|,|x|) pairs of datapoints. 

But Figure 2 also presents counter-evidence for the conformity 
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Figure 2: The dependence of helpfulness ratio on a review's 
signed deviation from average is inconsistent with both the 
brilliant-but-cruel and, because of the asymmetry, the confor- 
mity hypothesis. 



hypothesis, since that hypothesis incorrectly predicts that the con- 
necting lines would be horizontal. 

To account for Figure 2, one could simply impose upon the con- 
formity hypothesis an extra "tendency towards positivity" factor, 
but this would be quite unsatisfactory: it wouldn't suggest any un- 
derlying mechanism for this factor. So, we turn to the individual- 
bias hypothesis instead. 

In order to distinguish between conformity effects and individual- 
bias effects, we need to examine cases in which individual people's 
opinions do not come from exactly the same (single-peaked, say) 
distribution; for otherwise, the composite of their individual biases 
could produce helpfulness ratios that look very much like the re- 
sults of conformity. One natural place to begin to seek settings 
in which individual bias and conformity are distinguishable, in the 
sense just described, is in cases in which there is at least high vari- 
ance in the star ratings. Accordingly, Figure 3 separates products 
by the variance of the star ratings in the reviews for that product in 
our dataset. 

One can immediately observe some striking effects of variance. 
First, we see that as variance increases, the "camel plots" of Figure 
3 go from a single hump to two.^ We also note that while in the pre- 
vious figures it was the reviews with a signed deviation of exactly 
zero that had the highest helpfulness ratios, here we see that once 
the variance among reviews for a product is 3.0 or greater, the high- 
est helpfulness ratios are clearly achieved for products with signed 
deviations close to but still noticeably above zero. (The beneficial 
effects of having a star rating slightly above the mean are already 
discernible, if small, at variance 1.0 or so.) 

Clearly, these results indicate that variance is a key factor that 
any hypothesis needs to incorporate. In Section 5, we develop a 
simple individual-bias model that does so; but first, there is one last 
hypothesis that we need to consider. 



^This is a reversal of nature, where Bactrian (two-humped) camels 
are more agreeable than one-humped Dromedaries. 




Figure 3: As the variance of the star ratings of reviews for a particular product increases, the median helpfulness ratio curve becomes 
two-humped and the helpfulness ratio at signed deviation (indicated in red) no longer represents the unique global maximum. There 
are non-zero signed deviations in the plot for variance because we rounded variance values to the nearest .5 increment. 



4. CONTROLLING FOR TEXT QUALITY: 
EXPERIMENTS WITH "PLAGIARISM'' 

As we have noted, our analyses do not explicitly take into ac- 
count the actual text of reviews. It is not impossible, therefore, 
that review text quality may be a confounding factor and that our 
straw-man quality-only hypothesis might hold. Specifically, we 
have shown that helpfulness ratios appear to be dependent on two 
key non-textual aspects of reviews, namely, on deviation from the 
computed star average and on star rating variance within reviews 
for a given product; but we have not shown that our results are not 
simply explained by review quality. 

Initially, it might seem that the only way to control for text qual- 
ity is to read a sample of reviews and determine whether the Ama- 
zon helpfulness ratios assigned to these reviews are accurate. Un- 
fortunately, it would require a great deal of time and human effort 
to gather a sufficiently large set of re-evaluated reviews, and human 
re-evaluations can be subjective; so it would be preferable to find a 
more efficient and objective procedure. ^ 



Liu et al. [17] did perform a manual re-evaluation of 4909 digital- 
camera reviews, finding that the original helpfulness ratios did not 
seem well-correlated with the stand-alone comprehensiveness of 
the reviews. But note that this could just mean that at least some of 
the original helpfulness evaluators were using a different standard 
of text quality (Amazon does not specify any particular standard 
or definition of helpfulness). Indeed, the exemplary "fair" review 
quoted by Liu et al. begins, "There is nothing wrong with the [prod- 
uct] except for the very noticeable delay between pics. [Description 
of the delay.] Otherwise, [other aspects] are fine for anything from 
Internet apps to ... print enlarging. It is competent, not spectacu- 



A different potential approach would be to use machine learning 
to train an algorithm to automatically determine the degree of help- 
fulness of each review. Such an approach would indeed involve 
less human effort, and could thus be applied to larger numbers of 
reviews. However, we could not draw the conclusions we would 
want to: any mismatch between the predictions of a trained classi- 
fier and the helpfulness ratios observed in held-out reviews could 
be attributable to errors by the algorithm, rather than to the actions 
of the Amazon helpfulness evaluators.^ 



lar, but it gets the job done at an agreeable price point." Liu et al. 
give this a rating of "fair" because it only comments on some of 
the product's aspects, but the Amazon helpfulness evaluators gave 
it a helpfulness ratio of 5/6, which seems reasonable. Also, reviews 
might also be evaluated vis-a-vis the totality of all reviews, i.e., a 
review might be rated helpful if it provides complementary infor- 
mation or "adds value". For instance, a one-line review that points 
out a serious flaw in another review could well be considered "help- 
ful", but would not rate highly under Liu et al.'s scheme. 
It is also worth pointing out subjectiveness can remain an issue 
even with respect to a given text-only evaluation scheme. The two 
human re-evaluators who used Liu et al.'s [2007] standard assigned 
different helpfulness categories (in a four-category framework) to 
619=12.5% of the reviews considered, indicating that there can be 
substantial subjectiveness involved in determining review quality 
even when a single standard is initially agreed upon. 
^Ghose and Ipeirotis [11] observe that their trained classifier often 
performed poorly for reviews of products with "widely fluctuat- 
ing" star ratings, and explain this with an assertion that the Ama- 
zon helpfulness evaluators are not judging text quality in such sit- 
uations. But there is no evidence provided to dismiss the alterna- 
tive hypothesis that the helpfulness evaluators are correct and that, 



We thus find ourselves in something of a quandary: we seem 
to lack any way to derive a sufficiently large set of objective and 
accurate re-evaluations of helpfulness. Fortunately, we can bring to 
bear on this problem two key insights: 

1. Rather than try to re-evaluate all reviews for their helpful- 
ness, we can focus on reviews that are guaranteed to have 
very similar levels of textual quality. 

2. Amazon data contains many instances of nearly-identical re- 
views [7] — and identical reviews must necessarily exhibit 
the same level of text quality. 

Thus, in the remainder of this section, we consider whether the 
effects we have analyzed above hold on pairs of "plagiarized" re- 
views. 

Identifying "plagiarism" (as distinct from "justifiable copying"). 

Our choice of the term "plagiarism" is meant to be somewhat evoca- 
tive, because we disregard several types of arguably justifiable copy- 
ing or duplication in which there is no overt attempt to make the 
copied review seem to be a genuinely new piece of text; the reason 
is because this kind of copying does not suit our purposes. How- 
ever, ill intent cannot and should not be ascribed to the authors of 
the remaining reviews; we have attempted to indicate this by the 
inclusion of scare quotes around the term. 

In brief, we only considered pairs of reviews where the two re- 
views were posted to different books — this avoids various types 
of relatively obvious self-copying (e.g., where an author reposts a 
review under their user ID after initially posting it anonymously), 
since obvious copies might be evaluated differently. 

We next adapted the code of Sorokina et al. [20] to identify those 
pairs of reviews of different products that have highly similar text. 
To do so, we needed to decide on a similarity threshold that deter- 
mines whether or not we deem a review pair to be "plagiarized". A 
reasonable option would have been to consider only reviews with 
identical text, which would ensure that the reviews in the pairs had 
exactly the same text quality. However, since the reviews in the 
analyzed pairs are posted for different products, it is normal to ex- 
pect that some authors modified or added to the text of the original 
review to make the "plagiarized" copy better fit its new context. 
For this reason, we employed a threshold of 70% or more nearly- 
duplicate sentences, where near-duplication was measured via the 
code of Sorokina et al. [20].^^ This yielded 8,313 "plagiarized" 
pairs; an example is shown in Figure 4. Manual inspection of a 
sample revealed that the review pairs captured by our threshold in- 
deed seem to consist of close copies. 



Confirmation that text quality is not the (only) explanatory fac- 
tor. Since for a given pair of "plagiarized" reviews the text quality 
of the two copies should be essentially the same, a statistically sig- 
nificant difference between the helpfulness ratios of the members 
of such pairs is a strong indicator of the influence of a non-textual 
factor on the helpfulness evaluators. 

An initial test of the data reveals that the mean difference in help- 
fulness ratio between "plagiarized" copies is very close to zero. 

rather, the algorithm makes mistakes because reviews are more 
complex in such situations and the classifier uses relatively shal- 
low textual features. 

^^Kim et al. [15], who also noticed that the phenomenon of review 
alteration affected their attempts to remove duplicate reviews, used 
a similar threshold of 80% repeated bigrams. 



26 of 30 people found the following review helpful: 

Skull-splitting headache guaranteed!!, June 16, 2004 
By A Customer 

If you enjoy a thumping, skull splitting migraine headache, then Sing N Learn is 
for you. 

As a longtime language instructor, I agree with the attempt and effort that this 
series makes, but it is the execution that ultimately weakens Sing N Learn 
Chinese. 

To be sure, there are much, much better ways to learn Chinese. In fact, I would 
recommend this title only as a last resort and after you've thoroughly exhausted 
traditional ways to learn Chinese. . . . 

7 of 11 people found the following review helpful: 

<< Migraine Headache at No Extra Charge, May 28, 2004 

By A Customer 

If you enjoy a thumping, skull splitting migraine headache, then the Sing N Learn 
series is for you. 

As a longtime language instructor, I agree with the effort that this series makes, 
but it is the execution that ultimately weakens Sing N Learn series. To be sure, 
there are much, much better ways to learn a foreign language. In fact, I would 
recommend this title only as a last resort and after you've thoroughly exhausted 
traditional ways to learn Korean. . . . 

Figure 4: The first paragraphs of "plagiarized" reviews 
posted for the products Sing 'n Learn Chinese and 
Sing ' n Learn Korean. In the second review, the title is 
different and the word "chinese" has been replaced by "ko- 
rean" throughout. Sources: http : / /www . amazon . com/ 
re view/RHE2 G1M8 VOH 9N/ re f =cm_cr_rdp_perm and 
http : //www. amazon . com/review/RQYHTSDUNM732/ 
r e f = cm_c r_r dp_pe rm. 



However, a confounding factor is that for many of our pairs, the 
two copies may occur in contexts that are practically indistinguish- 
able. Therefore, we bin the pairs by how different their absolute 
deviations are, and consider whether helpfulness ratios differ at 
least for pairs with very different deviations. More formally, for 
i^j G {0, 0.5, • • • , 3.5} where i < j, we write iyj (conversely, 
i-<j) when the helpfulness ratio of reviews with absolute deviation 
i is significantly larger (conversely, smaller) than that for reviews 
with absolute deviation j. Here, significantly larger or smaller 
means that the Mantel-Haenszel test for whether the helpfulness 
odds ratio is equal to 1 returns a 95% confidence interval that does 
not contain 1.^^ The Mantel-Haenszel test [2] measures the strength 
of association between two groups, giving more weight to groups 
with more data. (Experiments with an alternate empirical sampling 
test were consistent.) We disallow j = 4 since there are only rel- 
evant 24 pairs which would have to be distributed among 8 (i, j) 
bins. 

The existence of even a single pair (i^j) in which iyj or i-<j 
would already be inconsistent with the quality-only hypothesis. Ta- 
ble 1 shows that in fact, there is a significant difference in a large 
majority of cases. Moreover, we see no "-<" symbols; this is con- 
sistent with Figure 1, which showed that the helpfulness ratio is 
inversely correlated with absolute deviation prior to controlling for 
text quality. 

We also binned the pairs by signed deviation. The results, shown 
in Table 2, are consistent with Figure 2. First, all but one of the sta- 
tistically significant results indicate that "plagiarized" reviews with 
star rating closer to the product average are judged to be more help- 



^^To avoid drawing conclusions based on possible numerical- 
precision inaccuracies, we consider any confidence interval that 
overlaps the interval [0.995,1.005] to contain 1. This "overlap" 
policy affects only two bins in Table 1 and two bins in Table 2. 



fuL Second, the (— i, i) results are consistent with the asymmetry 
depicted in Figure 2 (i.e., the "upward slant" of the green lines). 

Note that the sparsity of the "plagiarism" data precludes an anal- 
ogous investigation of variance as as a contextual factor. 
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Table 1: "Plagiarized" reviews with a lower absolute devia- 
tion tend to have larger helpfulness ratios than duplicates with 
higher absolute deviations. Depicted: whether reviews with de- 
viation i have an helpfulness ratio significantly larger (^) or 
significantly smaller no such cases) than duplicates with ab- 
solute deviation j (blank: no significant difference). 
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5. A MODEL BASED ON INDIVIDUAL BIAS 
AND MIXTURES OF DISTRIBUTIONS 

We now consider how the main findings about helpfulness, vari- 
ance, and divergence from the mean are consistent with a simple 
model based on individual bias with a mixture of opinion distribu- 
tions. In particular, our model exhibits the phenomenon observed in 
our data that increasing the variance shifts the helpfulness distribu- 
tion so it is first unimodal and subsequently (with larger variance) 
develops a local minimum around the mean. 

The model assumes that helpfulness evaluators can come from 
two different distributions: one consisting of evaluators who are 
positively disposed toward the product, and the other consisting of 
evaluators who are negatively disposed toward the product. We will 
refer to these two groups as the positive and negative evaluators 
respectively. 

We need not make specific distributional assumptions about the 
evaluators; rather, we simply assume that their opinions are drawn 
from some underlying distribution with a few basic properties. Specif- 
ically, let us say that a function / : R ^ R is ji-centered, for some 
real number if it is unimodal at /x, centrally symmetric, and 
(i.e. it possesses a continuous second derivative). That is, / has a 
unique local maximum at f is non-zero everywhere other than 
/i, and /(/i+x) = /(/i— x) for allx. We will assume that both pos- 
itive and negative evaluators have one-dimensional opinions drawn 
from (possibly different) distributions with density functions that 
are //-centered for distinct values of ji. 

Our model will involve two parameters: the balance between 
positive and negative reviewers p, and a controversy level a > 0. 
Concretely, we assume that there is a p fraction of positive evalua- 
tors and a 1 —p fraction of negative evaluators. (For notational sim- 
plicity, we sometimes write q for 1 — p.) The controversy level con- 
trols the distance between the means of the positive and negative 
populations: we assume that for some number /x, the density func- 
tion / for positive evaluators is (/x + ga) -centered, and the density 
function g for negative evaluators is (/i — pa) -centered. Thus, the 
density function for the full population is h{x) = pf{x) + qg{x), 
and it has mean p{/j. + qa) + q{/j. — pa) = /x. In this way, our 
parametrization allows us to keep the mean and balance fixed while 
observing the effects as we vary the controversy level a. 



Table 2: The same type of analysis as Table 2 but with signed 
deviation. The first (resp. second) table is consistent with the 
lefthand (resp. righthand) side of Figure 2. The third table is 
consistent with the "upward slant" of the green lines in Figure 
2: for the same absolute deviation value, when there is a sig- 
nificant difference in helpfulness odds ratio, the difference is in 
favor of the positive deviation. 

(There are a noticeable number of blank cells, indicating that a sta- 
tistically significant difference was not observed for the corresponding 
bins, due to sparse data issues: there are twice as many bins as in the 
absolute-deviation analysis but the same number of pairs.) 

Now, under our individual-bias assumption, we posit that each 
helpfulness evaluator has an opinion x drawn from h, and each 
regards a review as helpful if it expresses an opinion that is within a 
small tolerance of x. For small tolerances, we expect therefore that 
the helpfulness ratio of reviews giving a score of x, as a function 
of X, can be approximated by h{x). Hence, we consider the shape 
of h{x) and ask whether it resembles the behavior of helpfulness 
ratios observed in the real data. 

Since the controversy level a in our model affects the variance 
in the empirical data (a is the distance between the peaks of the 
two distributions, and is thus related to the variance, but the bal- 
ance p is also a factor), we can hope that at as a increases one 
obtains qualitative properties consistent with the data: first a uni- 
modal distribution with peak between the means of / and g, and 
then a local minimum near the mean of h. In fact, this is precisely 
what happens. The main result is the following. 

Theorem 5.1. For any choice of f, g, and p as defined as 
above, there exist positive constants so < Si such that 

(i) When a < so, the combined density h(x) is unimodal, with 
maximum strictly between the mean of f and the mean of g. 

(ii) When a > Si, the combined density function h{x) has a 
local minimum between the means of f and g. 

Proof We first prove (i). Let us write fi f = fi -\- qa fox the mean 
of /, and fig = fi — pa for the mean of g. Since / and g have 



unique local maxima at their means, we have f" {fif) < and 
g" [fig) < 0. Since these second derivatives are continuous, there 
exists a constant 6 such that f (x) < for all x with \x — /j.f\ < 6, 
and g'\x) < for all x with |x — /x^ | < ^. Since fif — fig = a, if 
we choose a < 5, then f"{x) and g" {x) are both strictly negative 
over the entire interval [/x^ , /i/] . 

Now, f {x) and g' {x) are both positive for x < fig, and they 
are both negative for x > /i/. Hence h{x) — pf{x) + qg{x) has 
the properties that (a) h\x) > for x < /i^; (b) h'{x) < for 
X > iif, and (c) h"{x) < for x G [/x^, /i/]. From (a) and (b) it 
follows that h must achieve its maximum in the interval [/x^, /x/], 
and from (c) it follows that there is a unique local maximum in this 
interval. Hence setting eo = 6 proves (i). 

For (ii), since / and g must both, as density functions that are 
both centered around their respective means, go to as x increases 
or decreases arbitrarily, we can choose a constant c large enough 
that/(/x/-x)+5f(x + /x^) < min(p/(/x/),g'^(/x^))forallx > c. 
If we then choose a > c/ min(p, q), we have fif — fi > c and 
fi- fig > c, and so /i(/x) = p/(/x) + qg{ij) < /(/x) + g{iJ.) < 
min(p/(/x/), ^^(/Xp)) < min(/i(/x/), /i(/Xp)), where the second 
inequality follows from the definition of c and our choice of a. 
Hence, h is lower at its mean /x than at either of /x/ or /x^, and 
hence it must have a local minimum in the interval [/x^, /x/]. This 
proves (ii) with e\ — cj min(p, q). ■ 
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Figure 5: An illustration of Theorem 5.2, using translated 
Gaussians as an example — the theorem itself does not require 
any Gaussian assumptions. 

For density functions at this level of generality, there is not much 
one can say about the unimodal shape of h in part (i) of Theo- 
rem 5.1. However, if / and g are translates of the same function, 
and their next non-zero derivative at is positive, then one can 
strengthen part (i) to say that the unique maximum occurs between 
the means of h and / when p > | , and between the means of h and 
g when p < In other words, with this assumption, one recovers 
the additional qualitative observation that for small separations be- 
tween the functions, it is best to give scores that are slightly above 
average. We note that Gaussians are one basic example of a class 
of density functions satisfying this condition; there are also others. 
See Figure 5 for an example in which we plot the mixture when 
/ and g are Gaussian translates, with p fixed but changing a and 
hence changing the variance. (Again, it is not necessary to make 



a Gaussian assumption for anything we do here; the example is 
purely for the sake of concreteness.) 

Specifically, our second result is the following. In its statement, 
we use f^^^ (x) to denote the derivative of a function /, and 
recall that we say a function is if it has at least j continuous 
derivatives. 

Theorem 5.2. Suppose we have the hypotheses of Theorem 5.1, 
and additionally there is a function k such that /(x) = A:(x — /x/) 
and g{x) = k{x — fig). (Hence k is unimodal with its unique local 
maximum at x — 0.) 

Further, suppose that for some j, the function k is C^^^ and we 
have k'^-^^ (0) > and k^^"^ (0) = Ofor 2 < i < j. Then in addition 
to the conclusions of Theorem 5.1, we also have 

(i ) There exists a constant Sq such that when a < Sq, the com- 
bined density h{x) has its unique maximum strictly between 
the mean of f and the mean of h when p > \, and strictly 
between the mean of g and the mean ofh when p < \. 

Proof We omit the proof, which applies Taylor's theorem to k' , 
due to space limitations. ■ 

We are, of course, not claiming that our model is the only one 
that would be consistent with the data we observed; our point is 
simply to show that there exists at least one simple model that ex- 
hibits the desired behavior. 

6. CONSISTENCY AMONG COUNTRIES 

In this section we evaluate the robustness of the observed social- 
effects phenomena by comparing review data from three additional 
different national Amazon sites: Amazon.co.uk (U.K), Amazon.de 
(Germany) and Amazon.co.jp (Japan), collected using the same 
methodology described in Section 2, except that because of the par- 
ticulars of the AWS API, we were unable to filter out mechanically 
cross-posted reviews from the Amazon.co.jp data. It is reasonable 
to assume that these reviews were produced independently by four 
separate populations of reviewers (there exist customers who post 
reviews to multiple Amazon sites, but such behavior is unusual). 

There are noticeable differences between reviews collected from 
different regional Amazon sites, in both average helpfulness ratio 
and review variance (Table 3). The review dynamics in the U.K. 
and Japan communities appear to be less controversial than in the 
U.S. and Germany. Furthermore, repeating the analysis from Sec- 
tion 3 for these three new datasets reveals the same qualitative pat- 
terns observed in the U.S. data and suggested by the model intro- 
duced in Section 5. Curiously enough, for the Japanese data, in 
contrast to its general reputation of a collectivist culture [4], we ob- 
serve that the left hump is higher than the right one for reviews with 
high variance, i.e., reviews with star ratings below the mean are 
more favored by helpfulness evaluators than the respective reviews 
with positive deviations (Figure 6). In the context of our model, 
this would correspond to a larger proportion of negative evaluators 
(balance p < 0.5). 

7. CONCLUSION 

We have seen that helpfulness evaluations on a site like Ama- 
zon.com provide a way to assess how opinions are evaluated by 
members of an on-line community at a very large scale. A review's 
perceived helpfulness depends not just on its content, but also the 
relation of its score to other scores. This dependence on the score 
contrasts with a number of theories from sociology and social psy- 
chology, but is consistent with a simple and natural model of indi- 
vidual bias in the presence of a mixture of opinion distributions. 





Total reviews 


Avg h.ratio 


Avg star rating van 


U.S. 


1,008,466 


0.72 


1.34 


U.K. 


127,195 


0.80 


0.95 


Germany 


184,705 


0.74 


1.24 


Japan 


253,971 


0.69 


0.93 



Table 3: Comparison of review data from four regional sites: 
number of reviews with 10 or more helpfulness votes, average 
helpfulness ratio, and average variance in star rating. 
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Figure 6: Signed deviations vs. helpfulness ratio for variance 
= 3, in the Japanese (left) and U.S. (right) data. The curve for 
Japan has a pronounced lean towards the left. 



There are a number of interesting directions for further research. 
First, the robustness of our results across independent populations 
suggests that the phenomenon may be relevant to other settings in 
which the evaluation of expressed opinions is a key social dynamic. 
Moreover, as we have seen in Section 6, variations in the effect 
(such as the magnitude of deviations above or below the mean) can 
be used to form hypotheses about differences in the collective be- 
haviors of the underlying populations. Finally, it would also be 
very interesting to consider social feedback mechanisms that might 
be capable of modifying the effects we observe here, and to con- 
sider the possible outcomes of such a design problem for systems 
enabling the expression and dissemination of opinions. 
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